| ABO | Rh Factor | Probability |
|---|---|---|
| A | - | 0.005 |
| A | + | 0.270 |
| AB | - | 0.001 |
| AB | + | 0.070 |
| B | - | 0.004 |
| B | + | 0.250 |
| O | - | 0.010 |
| O | + | 0.390 |
Transmission of Genetic Information
Introduction
Motivation and Goals
Many topics within the study of genetics have a strong link to numerical or quantitative analysis. Long before the identification of DNA as the primary carrier of genetic information, both scientists and non-scientists were conducting breeding experiments. Initially these “experiments” had the goal of domesticating animals for human benefits (e.g., cattle, pigs, goats, dogs). Although not true experiments as we know them today, the end result of those efforts played a large part in the rise of agriculture and civilizations.
At least ten thousand years passed before the investigations of predictable breeding phenotypes in pea plants by Gregor Mendel (1822-1884), which ultimately led to modern genetics. The study of genetics during the first half of the 20th century was carried out without an understanding of the structure of DNA. Even without this specific knowledge, scientists were able to make great strides in understanding the patterns of evolution in populations. Much of this research was aided by simultaneous advances in statistics during the same time period, which in many cases were being developed to help make sense of the vast array of data that was being collected.
We have developed these exercises to give you an introduction to some of the statistical approaches that are used in the study of genetics. We hope that you will use these to further develop your understanding of topics in genetics that you are covering in lecture and that you will begin to gain some intuition about the the methods used to test hypotheses. Some of you might even wish to go deeper into understanding the math and computer code behind these exercises.
Because quantitative aspects of genetics necessarily involve numbers and equations, we will need to use them. By working through examples, we hope to demystify the numbers. Complex math is not usually necessary, and most can be accomplished with only addition, subtraction, multiplication, and division.
Finally, we will use computer code to carry out analyses and generate plots. These exercises are not designed to be a course in coding, and you do not need to learn any additional material to use the exercises. You will be able to work through by just clicking the “Run Code” button, entering or changing a few values, and clicking “Run Code” again. As mentioned above, if you are interested in the computational side, we encourage you to edit, explore, and learn all you can.1 But that is not necessary to use these exercises for your benefit.
How to use these exercises
This web page (https://biosc-2200.github.io/TGI/) is refreshed each time you visit. It is available any time, both on and off campus. It should look the same on laptops, tablets, and phones, although the text will be necessarily small on your phone.
The first and most important directive is to read carefully, because there is no need to rush through. The exercises are designed to be worked linearly. Each piece builds on the those that come before it. Work through deliberately, when you get to a code chunk (see below), pay attention to any edits or data that you need to input.
There are prompts throughout in that ask you to think, make a prediction, etc. It might be useful to save these responses to a separate document or save/print this page when you are done. Unfortunately, there is no way to save your work.
These prompts look like this:
Take a few moments to think about what you hope to gain from these exercises.
R and webR
These exercises use the statistical programming language R for analysis and plotting. You don’t need to already know any coding in R or any other language. Everything you need to know will be provided.
On this web page, R is running from within your web browser using the webR framework, which means that you do not need to install any software (nor is any software permanently installed on your device).
When you first load the web page, it will take a few seconds for the software to load. When that is complete, you will see the icon
at the top of the page. Once you see the green “Ready!” sign, everything is set.
Code blocks
As you work through, you will encounter code blocks. Some code blocks will run when the page is first loaded (no interaction required on your part). Other code blocks won’t execute until you ask.
These code blocks will look like this, with a “Run Code” button:
Clicking the “Run Code” button will cause R to run the code and print the result (in this case 4 – not very interesting, but we have to start somewhere). The simplest example is to just use R as a fancy calculator. If you haven’t already, click “Run Code” above.
Feel free to try different values and different operators (+, -, *, /, ^). You can’t break anything. The worst that will happen is that you will get a syntax error. Execute the code below to see what a syntax error looks like.
You can edit the block to make the code execute successfully, or just leave it and move on.
Learning Objectives
The learning objectives for this exercise are:
- Calculate the probability of a particular gamete being produced from an individual, assuming independent assortment.
- Calculate the probability of a particular genotype in the offspring, given independent assortment and random fertilization between two individuals.
- Design genetic crosses to provide information about genes, alleles, and gene functions.
- Use a chi-squared test to determine how well data from a genetic cross fits theoretical predictions.
- Explain how sample size is related to the certainty of a chi-squared test.
Review of Inheritance
When gametes are produced, each gamete receives a single copy of each gene. Because each of the F1 parents has two copies, there are four possible alleles that can be passed on (for example two copies of D and two copies of d in Figure 1). A Punnett square can be used to enumerate the possible genotype combinations that result from the proposed cross. By joining the genes from each parent, the resulting possible offspring genotypes can be determined. These genotypes can be converted into phenotypes if the nature of dominance of the alleles in known (Figure 1). In this example the D allele for the “tall” phenotype is dominant relative to the d allele.
The 1:2:1 genotypic ratio and 3:1 phenotypic ratio in the heterozygote cross example above represent theoretical probabilities for the distributions of genotypes and phenotypes. These ratios can be tested statistically to determine if a sample deviates significantly from the expected proportions. The exercises that follow will allow you explore the statistics of genetics and give you some practice testing whether observations follow the predicted patterns.
Proportions and Probabilities
When thinking about genotypes and phenotypes, it can be useful to sometimes think of counts as proportions and sometimes as probabilities. The 1:2:1 genotypic ratio above can be thought of as
- Proportions: 1/4 DD, 2/4 Dd, and 1/4 dd
- Percentages: 25% DD, 50% Dd, and 25% dd
- Probabilities: 0.25 DD, 0.5 Dd, and 0.25 dd
All are equal and mean the same thing. In some contexts, using one vs. another makes more sense. This set of proportions and probabilities all sum to one and represents the full range of possible combinations.
Probabilities are very predictable in the long run but are often unpredictable in the short run. Consider only one gamete being produced, say a sperm. Each sperm that is produced from a heterozygote Dd can carry either the D or d allele.
We can simulate the production of a single sperm with the following code. First we create an object that holds the possible alleles: “D” and “d”. The code sample(allele, size = 1) randomly samples 1 of the possible alleles. Run this code a few times to see the first few gametes produced.
You probably noticed that, quite often, sequences of two or more D or d were produced randomly.
The code block below repeats this sampling process 10 times and counts up the numbers of D and d gametes produced. The function replicate() generates the samples and table() counts Ds and ds.
How many D and d gametes do you predict from a total of 10? How confident are you in your answer?
Run the code a few times to generate new samples of 10 gametes. Revisit your prediction.
Now gradually increase the number of gametes produced by changing n = 10 to n = 50, n = 100, etc.
What happens to the relative counts of D and d as
nincreases? What does this tell you about our ability to predict the exact counts either when sample sizes are small or when sample sizes are large? In general, how often do you see exactly 50% D and 50% d?
Blood types in humans
To explore the concepts of independent assortment and probabilities, we will use the blood-typing system that is used to classify human blood types based on the expression of different antigens on the surface of red blood cells (Figure 2). Blood types are important for transfusion medicine, where the mixing of incompatible blood types can have fatal results.
Blood typing in humans primarily uses the ABO system, although almost 30 other blood typing systems exist, which focus on other aspect of red blood cell structure and physiology.
A and/or B refer to the presence of A and/or B antigens of the surface of the red blood cells. O denotes the absence of both A and B antigens. The American Red Cross has a website with animations about what blood types are compatible with other blood types. In general, A must be matched with A and B with B or a cross-reaction will take place. Because type O blood lacks both A and B antigens, it is considered a “universal donor”. And because type AB blood has both A and B antigens, it is considered a “universal recipient”.
Figure 3 shows the Punnett square for human blood types. Each parent can provide alleles for A, B, or no antigens (O), leading to a complex 1:2:1:1:1:2:1 ratio.
The right side of Figure 3 shows the blood types (i.e., phenotypes) associated with each of the possible genotypes. Although there are 7 possible genotypes, there are only 4 possible blood types: A, B, AB, and O.
Looking at the figure, you may have wondered (1) why there is no type AO or BO blood and (2) how type AB blood is produced.
- Because the O allele has no associated antigen, an individual with A+O alleles will have type A blood and an individual with B+O will have type B blood (the same is true for O+A and O+B). So we can say that A and B are dominant with respect to O.
- Because the A and B alleles produce A and B antigens, respectively, an individual with A+B or B+A will have type AB blood. The A and B alleles are thus co-dominant.
Rh Factor
Another important feature of red blood cells that is important for transplant medicine is the Rh factor. The genetics of Rh factor is much more complicated than the ABO system, with approximately 50 different proteins involved (Dean 2005). Because a few proteins are most commonly involved, this system can be summarized as Rh+ and Rh- (Rh-positive and Rh-negative) without going into all the details.
ABO type and Rh factor are determined by separate sets of genes, meaning that Rh factor is inherited separately from ABO type. Thus the rules of Independent Assortment apply. Each ABO blood type can be associated with Rh+ or Rh-.
Probabilities of blood types in a population
The distributions of A, B, AB, and O blood types as well as Rh factors differs among populations. In a hypothetical population, blood type probabilities for all the combinations of ABO types and Rh status are summarized in the following table.
Looking at tables of numbers is difficult, so we will plot the data. The relative proportions of the eight different blood types can be shown on a bar chart, where the height of the bar is proportional to the probability of an individual having a specific ABO/Rh combination:
Visualizing the table above in this way really emphasizes the relative rarity of Rh-negative blood types in this population.
Note that sum of all the probabilities is 1. When studying the range of probabilities for an event (here a person’s blood type), it is important that all the probabilities add up to 1, which means that we have accounted for all the possible outcomes.
sum() adds up the values passed to it. c() makes a collection of numbers, which here is the set of probabilities.
If a large number of people have their blood typed, the proportions of people with each blood type should be approximately equal to the overall proportions in the table and figure above.
Combining probabilities
Because of independent assortment, each of the 8 possible ABO types + Rh factor are independent of each other, which is to say that no one person can have multiple blood types. We can use the rules of addition to determine the probability that a randomly selected individual has a certain blood type. The first few are provided for you. See if you can figure out the rest (here we are just using R as a calculator).
Type O blood
Type AB blood
Type A or type B blood with any Rh factor
Any ABO type with Rh+
Not Type AB blood
These are all examples of the Addition Rule for probabilities. For independent events, you can add the individual probabilities to get the overall probability for a set of events. Generally these take the form of A or B – like rolling a 1 or a 2 from a single die roll. Each has a probability of 1/6, so the probability of 1 or 2 is 1/6 + 1/6 = 2/6 = 1/3.
For conditional events, which result from independent assortment, we have to multiply two probabilities (the Multiplication Rule for probabilities). The flowchart below shows the calculation of the probabilities for O+ and O- blood types.
Starting from the left, we have all blood types. In this population, 40% have type O blood (including both Rh+ and Rh-), and 60% have all the other types combined.
In this example we will only follow the type O blood paths. Among those with type O blood, 97.5% have the Rh+ phenotype, and the remaining 2.5% have the Rh- phenotype. Because ABO type and Rh factor assort independently, the separate probabilities are multiplied to get the final probabilities. \(0.4 \times 0.025 = 0.01\) and \(0.4 \times 0.975 = 0.39\).
We could repeat this process for types A, B, and AB, and make a giant chart with all the possibilities.
Sampling from a population
The probabilities the the blood type table above represent what are known as “long-run” probabilities, meaning that if we sample more and more people (up to the entire population size), the proportions that we observe will match the expected probabilities.
What happens if we sample from a small group, one that is much smaller than the entire population?
Questions like this become very important when thinking about blood donation programs. Not everyone donates blood and the specific population from which blood is sampled can have a dramatic effect on the relative supply of different blood types in a community.
The function below mimics the process of blood donations from a random sample of people. Random samples (i.e., donors) are drawn from a distribution where the probability of each of the eight blood types occurring is the same as in the table above.
The gray bars are the expected counts, and the red points are the observed counts in a particular sample. Run the code a few times and examine how the plot changes each time.
What do you observe about the distribution of counts in the observed sample (red points) compared to the predicted probabilities (gray bars)? What does this tell you about the need to specifically recruit donors with rare blood types?
Go back to the code block above. Try raising the number of people sampled and see how the distribution changes?
What looks different when 1000 or 10000 people are sampled?
Goodness of fit
At this point you gaining some intuition for how the random processes that are a normal part of sampling can lead to quite variable results each time a new sample is taken. Sometimes a sample will closely match the predicted probabilities:
- Occasionally 5 D alleles and 5 d alleles are present when 10 gametes are produced
- Occasionally the counts of blood types in a sample of 100 people agree well with the predictions
In both cases, as we increase the sample size (500, 1000, 10000, etc.), the observed counts more closely match the predicted counts.
It would be good to have a way to explicitly test whether the observed counts in a sample are a close match to what we expect. Essentially we want to know:
- Do these observations represent a random sample from a given population?
- Do these counts differ significantly from the expected counts for a given population?
These questions are basically asking the same thing. Both are answered by a statistical test for counts: a “goodness of fit” test. The name is fairly self-explanatory – we are testing how “good” the observations “fit” the expected.
There are many different goodness of fit tests in statistics, but the one that is used most often in genetics is the chi-squared test2. That is the test we will use here.
Although the chi-squared test is fairly straightforward as we will see, the equation is a little daunting at first:
\[\chi^2 = \sum_{i=1}^n \frac{(Observed_i - Expected_i)^2}{Expected_i}\]
We can decode this “statistical sentence” piece by piece.
- \(\chi^2\) is the value that we are going to calculate. It is a single number that represents the test statistic.
- \(\Sigma\) is the shorthand for a summation, adding a set of numbers together
- \(i = 1\) tells us where to start adding
- \(n\) tells us where to stop adding. In this case \(n\) represents the number of groups we are testing.
- \((Observed_i - Expected_i)^2\) says to subtract the expected count from the observed count and then square that number, which is itself divided by the expect count (\(Expected_i\)). The subscript \(i\)s are related to the \(i = 1\) to \(n\) for the summation, just accounting for the fact that we’ll do this calculation once for each group.
You can think of the equation as a shorthand way to tell you what to do with all the counts you have. In plain language:
- For each group:
- Subtract the expected count from the observed count, square it, and divide by the expected count
- Add up all those numbers to get the \(\chi^2\) value.
Using a chi-squared test
Before we get to the case study, we will take slight detour and apply the chi-squared test to a data set. The data below are the number of births on each day of the week for 350 consecutive births in a hospital (not all the births on each day happened on the same day – that would be a very busy hospital).
Execute the code to save the data into R. If you are interested in what the code does:
tribble()is a function that lets us enter data line by line. Here we are making columns forDayandBirths. Each row represents one pair. Sunday has 33 births. Monday has 41, and so on.mutate()just puts the days in the correct order.
We can plot these data with the code block below.
For these data, the chi-squared test asks whether the number of births per day is equal across all the days. Since we expect that births happen randomly with respect to day of week (i.e., babies should have no concept of what day it is day), this seems like a reasonable null hypothesis.
Looking at the plot above, do you predict that births are evenly distributed (null hypothesis) or unevenly distributed?
If births are not evenly distributed throughout the week, what do you think might explain this pattern? What reason(s) might lead to more births on some days vs. others?
We will find out.
Thinking back to the equation for the chi-squared test, let’s figure out what we need to calculate. We have the observed counts (33, 41, 63, etc.). We need to figure out what the expect count per day is. See if you can reason out what number that is.
What is the expected count of births per day?
We can also use R to do the calculation:
Here are using sum() to add up the Births column of the DOB data. With 350 births spread over 7 days, we expect 50 per day. Now that we have the expected value, we can calculate \(\chi^2\).
We have done several things all at once in this code block. mutate() makes new columns. First we make a column Obs_Exp which is the observed count of Births minus Expected (50). We then use this new column to calculate the square of Obs_Exp.
Looking at the printout of the data, we can see that some days have more births than expected (Tuesday, Wednesday, Friday) and some have less (Sunday, Monday, Thursday, Saturday). All the squares are positive, as we expect (the squaring here is the derivation of the name \(\chi^2\)).
All that remains is to divide the Obs_Exp2 by Expected:
This last step normalizes the squared deviation by the expected count. You can imagine if there are a lot of observations, then the squared deviation can get quite large.
Finally, we add up the values in the last column. These values represent each day’s contribution to the overall \(\chi^2\) value.
We have a value of 15.24. What do we do with it? Just on it’s own, the \(\chi^2\) statistic doesn’t mean anything. We need a value to compare it to.
The way that the chi-squared test is set up, the test value (15.24) is compared to the value that marks the cutoff for some specified percentage of a chi-squared distribution.
The figure below shows the probability plot for a chi-squared distribution with 6 degrees of freedom3 Like the sets of blood types probabilities, this also sums to 1. But instead of adding up individual probabilities, here we calculate (integrate) the area under the curve. The area under the line is 1.
The area shaded in purple represents 5% of the area under the line. Most commonly in statistics, 5% is used as the cutoff when deciding that a particular test is “significant” or not. This means that we accept that 5% of the time we say that the test is significant when it is not.4
If the observed data were really drawn from a uniform distribution, meaning that counts for each day are equal, then 5% of the time, the \(\chi^2\) statistic will fall somewhere in the purple area.
For our test, all we need to know is the position on the x-axis of the left-hand edge of the purple area. This value is about 12.6 (you can confirm this because the left edge is just past the thin vertical line at 12.5). 12.6 is known as the critical value for the test.
If the \(\chi^2\) statistic is greater than the critical value, then we are able to reject the null hypothesis. Which says that these observations do not follow the expected distribution.
The test statistic we calculated was 15.24, which is greater than 12.6. So here we reject the null hypothesis. It appears that births do not happen randomly throughout the week.
Shortcut
R has a built-in function to carry out the calculations for us. Although it is very informative to work through the calculations manually, it’s much easier to just let the function do the work for us. That function is chisq.test().
The output shows much the same information that we already calculated: the \(\chi^2\) statistic of 15.24 and 6 degrees of freedom. One new value is the exact P-value. All that we knew above was that P was less than 0.05. Now we know that it is exactly 0.01847.
If we plot the chi-squared distribution again, we can add the observed statistic. P = 0.01847 means that 1.847% of the area under the line falls to the right of the black dot.
Take a few minutes to explore how the sample size impacts the \(\chi^2\) statistic and P-value. We can easily scale the data up and down:
Do the results make sense to you based on what you know about how sample size impacts our certainty when making inferences about a sample?
Case study: Confirming a recessive muscle mutation in mice
In the early 1990’s, Ted Garland and colleagues began a long-term experiment breeding experiment using house mice (Mus musculus). The goal of this experiment was to explore the inter-relatedness of traits involved in physiology, anatomy, brain function, immunology, behavior, and many others.
From a random, genetically diverse initial population, the research team established eight separate groups of mice. Four of the groups would be artificially selected for high levels of voluntary activity (measured using a running wheel attached to the cage) and the other four would be bred randomly. Here is a link to a book section that describes the experiment.
Because wheel-activity, the trait being selected, is heritable, the two sets of lines rapidly diverged (Figure 4).
Among other phenotypes, Garland and colleagues were interested in studying muscle physiology. Did the selected mice evolve to have different muscle masses or muscle fiber types? They were routines weighing the calf muscles of mice as part of their research and generated plots that looked like this:
This plot shows the mass of the calf muscles plotted against body mass. In general you expect to find that larger mice will have larger muscles. But you do not expect that one of the lines of mice would have such small muscles compared to the others. Over and over they found that some of the mice had muscles that weighed only about 50% as much as they should. In other words, some of the time mice had muscles that were only half as large as predicted.
They found this small muscle phenotype in three different lines, and estimated the prevalence across generations (Figure 5). These authors also found that the small muscle phenotype was favored by selection in these mice (Houle-Leroy et al. 2003).
Developing an hypothesis
Garland and colleagues examined pedigrees of mice that showed the small muscle phenotype. They found patterns like the ones below. In the pedigrees, hatching indicates that the mouse has the small muscle phenotype.
What kind of inheritance pattern do you think could explain the observed distribution of phenotypes (dominant vs. recessive, sex-linked vs. autosomal)? Briefly explain your reasoning.
Testing the hypothesis
Garland and colleagues hypothesized that the small muscle phenotype was caused by an autosomal recessive allele. They set out to determine if their hypothesis was correct by designing a test cross.
The original strain of mice that had been used to produce the 8 lines was outbred, meaning that it had extensive genetic variability.5. For the test cross, they chose to use a standard inbred strain of mouse: C57/Bl6 (“C57 Black 6”). This mouse did not show the small muscle phenotype.
For the other parental line, they chose all mice that had the small muscle phenotype.
Using a piece of paper, set up a Punnett square with a homozygous dominant C57/Bl6 father (MM genotype) and a homozygous recessive small muscle phenotype mother (mm genotype). What genotypes and phenotypes do you predict if the hypothesis is correct? Describe your results below.
The F1 generation should all be Mm (all combinations of alleles will be Mm). The resulting muscle phenotypes will be normal if the allele is recessive (a single copy of M is enough to result in the normal phenotype).
Garland and colleagues found the expected phenotypic pattern (but still didn’t know the underlying genotypes). Next they set up a second cross between the F1 mice resulting from the first cross and the small muscle phenotype parental female mice (this is called a “backcross”).
Using a piece of paper, set up a Punnett square with a heterozygous father (Mm genotype) and a homozygous recessive small muscle phenotype mother (mm genotype). What genotypes and phenotypes do you predict if the hypothesis is correct? Describe your results below.
In this cross, genotypes of half of the mice should be heterozygous (Mm) and half homozygous for the small muscle allele (mm). The phenotypic ratio will also be 1:1, because the heterozygous mice will have the normal muscle phenotype.
Results
In total, there were 404 mice in the F2 backcross generation (Hannon et al. 2008). The authors weighed the calf muscles of all the mice (called “triceps surae” here). The results are shown in Figure 6.
What is your prediction for the phenotypic ratio from jsut looking at the plot above? Does it appear that there is a 1:1 ratio of normal to small muscle mice?
Based on the observed muscle masses, 201 mice were classified as having the small muscle phenotype and 203 were classified and normal.
| Phenotype | Count |
|---|---|
| Normal | 203 |
| Small Muscle | 201 |
Testing the results
Now it’s your turn. In the code block below, we set up the data just like we did for the date of birth example.
Based on the observed counts, what do you predict that the test will show? Will the counts differ from a 1:1 ratio or will they not differ (the null hypothesis)?
In the code block below, use the counts above to carry out a chi-squared test for the data. You just need to add the column with the data. Look at the example above if you need a hint.
What are the \(\chi^2\) value, the degrees of freedom, and the associated P-value? What does this test tell you about the observed ratio of counts?
How do you think the results might have been different if the authors had only measured 40 mice instead of ~400? Feel free to change the numbers above and re-run the test.
Epilogue
The “mini-muscle” allele was confirmed as an autosomal recessive, just as Garland and colleagues hypothesized. After some work on the genetics of the small muscle phenotype, it was mapped to a 2.6 Mb region on chromosome 11 that encompassed about 100 genes (Hartmann et al. 2008). After several more years of more research, the allele was identified as a single nucleotide polymorphism (SNP) in the Myosin Heavy Chain 4 gene (Kelly et al. 2013). Homozygous mice with two copies of this allele (Myh4-/-) fail to produce a certain type of myosin, which results in small muscles that produce less force but do not fatigue as quickly as unaffected muscles.
References
Footnotes
We are happy to provide additional resources. Just ask.↩︎
Also called the chi-square test or the \(\chi^2\) test↩︎
6 degrees of freedom because we have 7 groups (days). For this test degrees of freedom is \(n - 1\).↩︎
For those interested, 5% is known as the alpha-level for the test.↩︎
Otherwise selective breeding would not have worked↩︎